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Abstract 

This article describes our participation of the Gene Ontology Curation task (GO task) in 
BioCreative IV where we participated in both subtasks: A) identification of GO evidence sen- 
tences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for rele- 
vant genes in full-text articles. For subtask A, we trained a logistic regression model to 
detect GOES based on annotations in the training data supplemented with more noisy 
negatives from an external resource. Then, a greedy approach was applied to associate 
genes with sentences. For subtask B, we designed two types of systems: (!) search-based 
systems, which predict GO terms based on existing annotations for GOESs that are of differ- 
ent textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the- 
art information retrieval techniques (i.e., a novel application of the idea of distant supervi- 
sion) and (ii) a similarity-based system, which assigns GO terms based on the distance be- 
tween words in sentences and GO terms/synonyms. Our best performing system for sub- 
task A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed 
overlap match. Our best performing system for subtask B, a search-based system, achieves 
an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our 
search-based systems for subtask B significantly outperformed the similarity-based system. 
Database URL: https://github.com/noname2020/Bioc 



Introduction 

The Gene Ontology (GO) provides a set of concepts for an- 
notating functional descriptions of genes and proteins in 
biomedical literature. The resulting annotated databases 
are useful for large-scale analysis of gene products. 
However, performing GO annotation requires expertise 



from well-trained human curators. Owing to the fast 
expansion of biomedical data, GO annotation becomes 
extremely labor-intensive and costly. Thus, texting mining 
tools that can assist GO annotation and reduce human 
effort are highly desired (1-3). 
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To promote research and tool development for assisting 
GO curation from biomedical literature, the Critical 
Assessment of Information Extraction in Biology 
(BioCreative) IV organized Gene Ontology Curation task 
(GO task) in 2013 (4). There are two subtasks: A) identifi- 
cation of GO evidence sentences (GOESs) for relevant 
genes in full-text articles and B) prediction of GO terms for 
relevant genes in full-text articles. The training set of GO 
task contains 100 full-text journal articles [in BioC format 
(5)], while the development and test sets each have 50 art- 
icles. Task organizers also provided ground truth annota- 
tions for the training and development sets to all 
participants (5). Table 1 gives the detailed statistics about 
genes, gene-related passages and GO terms in the GO task 
data. 

The following shows two sample passages and the cor- 
responding key information in the training and develop- 
ment sets: 

Key information for sample passage 1: 

<infon key="gene">cdc-14(173945)</infon> 
<infon key=" go-term "> embryo development ending in 
birth or egg hatchingiGO:0009792</infon> 
<infon key="goevidence">IMP</infon> 
<text>However, of all components tested, only the deple- 
tion of the C. elegans homologue of the budding yeast 
Cdcl4p phosphatase caused embryonic lethality in the off- 
spring of injected worms (Table l).</text> 

Key information for sample passage 2: 

<infon key="gene">cdc-14(173945)</infon> 

<infon key="go-term">phosphatase activity|GO:0016 

791</infon> 

<infon key="goevidence">NONE</infon> 
<text>CeCDC-14 is a phosphatase and localizes to the 
central spindle and the midbody</text> 

Given a set of relevant genes, for subtask A, we need to 
find GOESs, while for subtask B, we need to assign GO 
terms to each article (primarily based on the gene-related 
sentences identified in subtask A). 

In this article, we describe our participation systems for 
the GO task. For subtask A, we trained a logistic regression 
(LR) model to detect GOESs using the training data sup- 
plemented with noisy negatives from an external resource. 
A greedy approach was applied to associate relevant genes 
with sentences. For subtask B, we designed two types of 
systems: (i) search-based systems, which predict GO terms 
based on existing annotations for GOESs that are of differ- 
ent textual granularities (i.e., full-text articles, abstracts, 
and sentences) using state-of-the-art information retrieval 
techniques and (ii) a similarity-based system, which assigns 
GO terms based on the distance between words in sen- 
tences and GO terms/synonyms. 



Table 1. Statistics of the data set for BioCreative IV Track 4 
GO Task 



GO task data 


Training 


Development 


Test 




set 


set 


set 


Number of full-text articles 


100 


50 


50 


Number of genes 


300 


171 


194 


Number of gene-associated 


2234 


1247 


1681 


passages 








Number of GO terms 


954 


575 


644 



In the following sections, we will first describe our sys- 
tems in more detail. Then, we will present and discuss the 
official evaluation results. Finally, we draw conclusion and 
point possible directions for future work. 

System description 

Subtask A 

In this subtask, given a full-text article, we need to identify 
GOESs and associate genes related to these sentences. The 
task can be defined as a supervised machine learning task 
by considering GOESs as positives and all other sentences 
as negatives. As positives and negatives are from the same 
pool of articles, the resultant models may be overfitted. We 
supplemented negatives with unlabeled excerpts from 
GeneRIF (6) records aiming for better models based on our 
prior experience on distant supervision, i.e. use existing re- 
sources to obtain weakly labeled instances for training ma- 
chine learning classifiers (7, 8). 

Data preprocessing 

We extract positive and negative instances (i.e. sentences) 
from both training and developing sets to train our model. 
The training set contains 1318 positive and 26 868 nega- 
tive instances, while the development set gives 558 positive 
and 14 580 negative sentences. 

We use GeneRIF as an unlabeled data pool, which con- 
tains excerpts from literature about the functional annota- 
tion of genes described in EntrezGene. In particular, each 
record contains a taxonomy ID, a Gene ID, a PMID and a 
GeneRIF text excerpt extracted from literature. We ran- 
domly obtain 20 000 excerpts from human GeneRIF re- 
cords with at most two records per Gene ID and the 
corresponding articles not associated with any GO annota- 
tions (GOA) record based on GOA information available 
in iProClass (9). We assume those excerpts have a higher 
chance to be negatives, assuming that if the excerpts are 
evidence excerpts, the corresponding article has a higher 
probability to be included in GOA. The rationale behind 
this assumption is that the scope of the functional 
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annotation in GeneRIF is broader than that of GO. Besides 
the scope of GO annotation, GeneRIF also includes pheno- 
typic and disease information that are not the subject of 
GO annotation. Note that this assumption does not guar- 
antee all excerpts obtained to be true negatives. 

Feature extractions 

Bag-of-word (BOW) features. For each sentence, we gener- 
ate a vector of stemmed words. 

Bigram features. For each sentence, we generate a vector of 
bigrams by concatenating every two neighboring stemmed 
words in the sentence. We also have two boundary bigrams 
(SOS_Lw and Rw_EOS) where SOS indicates 'Start of the 
Sentence', EOS indicates 'End of the Sentence', Lw, the 
leftmost stemmed word and Rw, the rightmost stemmed 
word. 

Section feature. For each sentence, we include a feature to 
indicate which section the sentence is from (i.e. title, ab- 
stract, introduction, methods, discussion, etc.). 

Topic features. These features are generated by Latent 
Dirichlet Allocations (10), which can effectively group 
similar instances together based on their topics (11-14). 

Fresence of relevant genes. Because relevant genes of each 
article have been provided, we also use dictionary lookup 
to check the presence of relevant genes in the sentence. 

Model training 

We apply LR to predict labels for each instance. In particu- 
lar, we impose a constraint on model parameters in a regu- 
larized LR to avoid overfitting and to improve the 
prediction performance on unseen instances. Note that LR 
assigns probability scores to each class. In a task with 
skewed class distribution, a threshold can be chosen to op- 
timize the performance. 

Assembling gene and evidence sentence pairs 

For each article, all relevant genes are provided. Therefore, 
we use a greedy approach to associate evidence excerpts 
with the relevant genes. The approach includes four steps: 

Step 1. Direct matching with dictionary lookup. Direct dic- 
tionary lookup is done for each predicted positive sentence 
to detect whether there are relevant genes appearing in the 
sentence. If so, the corresponding genes found are assigned 
to that sentence. 

Step 2. Family name inferred. Because genes belonging to 
the same family can appear as plurals in the document, we 



assemble a dictionary of family names based on the gene 
mentions provided. For each mention of the family name 
in a sentence (using direct string matching), all of the mem- 
bers of that family in the gene list are assigned to the 
sentence. 

Step 3. Gene assignment based on proximity. For the re- 
maining predicted positive sentences with no relevant gene 
mentioned, we assume that prior sentences would contain 
the gene information. For positive sentence S, we perform 
direct string matching using the gene list provided and the 
family name dictionary assembled in Step 2 on all prior 
sentences belonging to the same section of S. Gene hits are 
identified similarly as in Steps 1 and 2. We then assign 
gene hits from the closest one (among all prior sentences 
with gene hits) to S. 

Step 4. Assignment based on gene-sentence distributions. 
For genes failed to be associated with any predicted posi- 
tive sentence, we picked sentences containing the corres- 
ponding genes with the largest positive probability score 
(assigned by the LR model) to be the evidence sentences. 

Submissions for subtask A 

We used LR-TRIRLS (15), which implements ridge regres- 
sion, to build LR models. We chose a threshold of 0.1 
based on the performance of the model trained using the 
training set and evaluated using the development set, 
where if a sentence has a probability >0.1 to be positive, 
then we consider it as positive. We submitted three runs 
(Al, A2 and A3) for subtask A. Runs Al and A2 used dif- 
ferent sets of unlabeled instances sampled from GeneRIF, 
and Run A3 combined the results from Al and A2. 

Subtask B 

In this section, we describe two systems that generated the 
first two runs of Task B. The basic idea is to leverage exist- 
ing GOA to label new articles. In particular, we search for 
relevant documents (sentences, abstracts or full-text art- 
icles) that have existing GOA to the target article, and then 
score and aggregate these existing GOA to produce the 
GOA for the target article. 

System Bl 

Figure 1 gives an overview of System Bl. We highlight 
external resources in blue and system modules with gray. 
Next, we describe each part in detail. 

Resources. We use the following external resources: (i) 
Panther (16), from which we build <GeneID, GOSlimID> 
pairs; (ii) iProClass (9), from which we obtain 
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Figure 1. Overview of System B1. 



<GOSlimID, GOID, PMID> triplets; (iii) a collection of 
PMC full-text articles that serve as the source for finding 
relevant documents and (iv) a collection of PubMed ab- 
stracts, used as a complementary source for retrieving, be- 
cause for some GOA records, only abstracts are publically 
available for the corresponding articles. 

Retrieval. We build indexes for the abstract collection and 
the full-text collection, respectively, using the Indri (17) 
search engine. In particular, we use the Porter stemmer for 
stemming words in the documents. We choose the query 
likelihood language model as our retrieval model. This 
model scores documents for queries as a function of the 
probability that query terms would be sampled (independ- 
ently) from a bag containing all the words in that docu- 
ment. Formally, the scoring function is a sum of the 
logarithms of smoothed probabilities: 

score(D, Q) = logP(Q|D) = g log " 

where qi is the i^^ query term, |Dj and |C| are the document 
and collection lengths in words, respectively, tf^^^o and 
tfq.^c are the document and collection term frequencies of 
qi, respectively, and is the Dirichlet smoothing parameter. 
The Indri retrieval engine supports this model by default. 

Query formulation. We formulate a query for each de- 
tected GOES from the output of subtask A. In particular, 
we filter stop words in the sentence using a standard stop 
word list. We leverage information in <GeneID, GOSLIM, 



GO> triples to reduce the GO candidate list (denoted as 
C), and then build a PMID candidate list by incorporating 
information in the <PMID, GOA> pairs. The following 
lists the detailed steps: 

• Given a gene G, we have a list of <G, GOES> pairs. 

• For each <G, GOES> pair, we find the corresponding 
<G, GOShmID> pairs. 

• For each <G, GOSlimID> pair, we get a hst of PMIDs 
based on <GOShmID, GOID, PMID> triplets. 

• Combine all PMIDs for G to get a <G, L> pair, where L 
is the PMID candidate list (a reduced searching list) for G. 

Annotation. The output from the retrieval model for a 
given <GeneID, GOES> pair is a list of documents ranked 
by their relevance scores. Based on the <GOSlimID, 
GOID, PMID> triplets, we obtain GOIDs for top-ranked 
k documents, and then weight each GOID by their corres- 
ponding document relevance score. We further aggregate 
scores of each GOID and take the top-ranked m GOID for 
each GOES. Finally, we combine GOID across all GOES, 
rank them according to their occurrences and keep GOID, 
which occurs more than ptimes. For our submission, we set 
<k, m,p > to <7, 10, 4> by training them on the 150 art- 
icles (i.e. the combination of training and development sets). 

System B2 

Figure 2 gives an overview of System B2, which has similar 
modules to System Bl. The major difference is that we use 
GeneRIF (6) as the external resource. In particular, we ex- 
tract < Sentence, GOID> pairs from GeneRIF where the 
corresponding articles are cited as evidence of GOA records 
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Figure 2. Overview of System B2. 



in iProClass and built an index for this collection of sen- 
tences. Therefore, the output from the Retrieval model is a 
ranked list of sentences, which we further converted to a 
ranked list of GOID based on <Sentence, GOID> pairs. 
Finally, in the Annotation module, we did the following: 

1. Starting from an initial list that contains top-ranked k 
GOID, select GOID one by one down the list until the 
score difference of current GOID with the topmost 
GOID is above threshold h. 

1. Aggregate GOID frequency across all GOES associated 
with a particular gene, and rank GOID by frequency. 

3. Take the top-ranked m GOID for each gene. 

Submissions for subtask B. 

System Bl and System B2 were implemented in Indri (18). 
By training them on the 150 articles (i.e. the combination 
of training and development sets), we set < k, m, p > to 
<7, 10, 4> for System Bl and <k, h, m> to <5, 0.1, 3> 
for System B2. We submitted three runs (Bl, B2 and B3) 
where Run Bl is the output of System Bl and Run B2 is 
the output of System B2. Run B3 is the output of a string 
matching algorithm. Specifically, we obtained all words in 
the sentences that are aligned to GO terms and synonyms 
when ignoring lexical variations (System B3). We then 
computed the Jaccard distance (19) between those matched 
words with GO terms and synonyms. A threshold of 0.75 
was used for GO term assignment. 

Evaluation metrics 

Both subtasks are evaluated using the standard precision 
(P), recall (R) and Fl-measure (Fl) scores (4). However, 



there are two different criteria for determining a match be- 
tween a candidate sentence and the ground truth sentence: 
(i) exact match between sentence boundaries and (ii) par- 
tial overlapping. Subtask B is also evaluated by P, R and 
Fl based on two different matching criteria: flat or hier- 
archical. For the flat P, R and Fl, a match occurs when the 
predicted GO term is exactly the same as the gold stand- 
ard. For hierarchical P, R and Fl, a match occurs when the 
predicted GO term has a common ancestor with the 
ground truth GO term. 



Results and discussion 

Table 2 presents the official evaluation results of subtask 
A. Runs Al and A3 obtain comparable Fl scores. Run A2 
has a lower Fl score because of the relatively low perform- 
ance for recall. Note that the performance difference be- 
tween Runs Al and A2 was purely because of different 
noisy negatives sampled from GeneRIF. 

During the development phase of systems for subtask A, 
we assessed the performance with or without the use of 
additional GeneRIF excerpts and the contributions of indi- 
vidual types of features. We found that the use of an un- 
labeled data set sampled from GeneRIF improved the Fl 
score by 0.03 compared with the baseline, which uses only 
positives and negatives from the training data set and 
BOW features. Also, including other features (bigrams, 
gene existence, section and topic features) led to perform- 
ance improvement over the baseline. In particular, section 
feature improved the Fl score by 0.01. Bigram and gene 
presence features each brought an improvement of 0.008. 
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Table 2. Official evaluation results for subtask A using trad- 
itional P, R and F-Measure (F1) 



System 


Overlap match 




Exact match 






P 


R 


Fl 


P 


R 


Fl 


Al 
A2 
A3 


0.313 
0.314 
0.307 


0.503 
0.442 
0.524 


0.386 
0.367 
0.387 


0.219 
0.220 
0.214 


0.352 
0.310 
0.366 


0.270 
0.257 
0.270 


Both strict exact match and relaxed overlap 


measure are 


considered. 




Table 3. Official evaluation results for subtask B using trad- 
itional (flat) precision (P), recall (R) and Fl-measure (F1) and 
hierarchical precision (hP), recall (hR) and Fl-measure (hFI) 


System 


Flat 






Hierarchical 






P 


R 


Fl 


hP 


hR 


hFI 


Bl 
B2 

B3 


0.054 
0.088 
0.029 


0.149 
0.076 
0.039 


0.079 
0.082 
0.033 


0.243 
0.250 
0.196 


0.459 
0.263 
0.310 


0.318 
0.256 
0.240 



Topic features further added 0.003 when the number of 
topics was set to 100. 

Table 3 presents the official evaluation results of sub- 
task B. The exact Fl scores for both types of systems are 
<0.1. System Bl achieves 0.301 for Hierarchical-Fl. Our 
search-based systems (i.e. Bl and B2) outperformed the 
similarity-based systems (i.e. B3) significantly. 

We were not aware of the need of containing experimen- 
tal methods for detecting GO evidence excerpts and assign- 
ing GO terms as specified by the annotation guideline. This 
may explain why the use of section features in subtask A has 
the most gain in the Fl score. Additionally, we sampled 
only from human GeneRIF records with at most two re- 
cords per gene. The rationale behind it is to avoid overrepre- 
sentation of popular studied genes and their homologous 
genes. It is not clear whether such sampling approach has 
impact on the performance of the system. 

Note that the use of GOSlim in System Bl aims to re- 
duce the number of candidate GO terms for consideration. 
As subtask B depends on subtask A, it is not clear how well 
our search-based methods for subtask B can achieve giving 
the gold standard output from subtask A. Owing to the 
time constraint, we leave this interesting investigation as a 
potential future investigation. 

Conclusion and future work 

Through the participation of the GO task, we investigated 
the use of distant supervision for detecting sentences for 



GO annotation assignment and explored the use of infor- 
mation retrieval techniques for finding relevant existing 
GOA and used them for assigning GO terms to new 
articles. 

The results look promising compared with previous 
challenges. However, there is still much room for improve- 
ment. Specifically, we plan to explore advanced text mod- 
eling methods including deep learning (20-23) and 
hierarchical/supervised topic modeling (24-26) for the 
task. We can make use of unlabeled text for feature extrac- 
tions or build deep belief networks for sparse feature learn- 
ing. With enough GOA, we can explore the use of 
hierarchical/supervised topic modeling for predicting GOA 
given evidence sentences. 
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